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A Bayesian Approach to Sparse plus Low rank Network Identification 


Mattia Zorzi and Alessandro Chiuso 


Abstract — We consider the problem of modeling multivariate 
stochastic processes with parsimonious dynamical models which 
can be represented with a sparse dynamic network with few 
latent nodes. This structnre translates into a sparse plus low 
rank model. In this paper, we propose a Bayesian approach to 
identify snch models. 

1. INTRODUCTION 

This paper deals with the network identification problem 
of a multivariate stochastic process of high dimension. Im¬ 
portant applications can be found in many fields, for instance 
in econometrics, [1], [5], [8], [21], social network analysis 
[13], system biology, [20] and so on. 

Let y be a stochastic process whose components (or 
variables) yk are manifest, i.e. can be directly measured. We 
define as network of this manifest process y a directed graph 
wherein nodes denote the variables and edges encode con¬ 
ditional Granger causality relations among these variables, 
[12]. More specifically, there is an edge from node i (i.e. 
variable yi) to node j (i.e. variable yj) if the past of yi is 
needed to predict yj conditioned on the past of all the other 
yk, k ^ i. Sparse graphs (i.e. with few edges) represent 
a concise and interpretable way to describe the relations 
among the variables of the manifest process, [7]. However, 
this modeling assumption does not exploit nor encode the 
fact that the manifest process may often be thought as driven 
by a low dimensional latent process, i.e. a stochastic process 
whose components cannot be directly measured. 

In this paper we formulate a new network identification 
problem which takes into account the presence of this low 
dimensional latent process and the variables of the manifest 
process Granger causes each other mostly through the latent 
variables. The corresponding network has a two layer struc¬ 
ture: one layer denotes the manifest variables and the other 
one the latent variables. The presence of few latent variables 
should drastically reduce the edges in the manifest layer, 
therefore increasing the degree of conciseness and robustness 
of the model. Finally, it turns out that a model described 
by this network has a sparse plus low rank (Sh-L) structure. 
More precisely, the low rank part depends on the number of 
latent variables, whereas the sparse one on the number of 
conditional Granger causality relations among the manifest 
variables. 
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When it comes to developing and identification algorithm, 
which maps measured data into an estimated dynamical 
model, it is fair to say that the prediction error method 
(PEM) is a consolidate paradigm in system identification 
[15], [19]. In the traditional setting, candidate models are 
described in fixed parametric model structures, e.g. ARMAX, 
whose complexity is determined using cross validation or 
information based criteria. Regularization has been recently 
introduced in the PEM framework, see [16], [17], [6], [18], 
as an alternative approach to control complexity of the 
estimated models. This latter class of methods start with a 
large enough (in principle infinite dimensional) model class; 
the inverse (ill-posed) problem of determining a specific 
model from a finite set of measured data can be made into a 
well posed problem using a penalty term, whose duty is to 
select models with specific features. In the Bayesian view, 
this is equivalent to the introduction of an a priori probability 
(i.e. prior) on the model to estimate. Eor instance, the prior 
should account the fact that the model is Bounded Input 
Bounded Output (BIBO) stable, [17]. 

The identification algorithm we propose belongs to this 
latter class of methods, where the predictor impulse response 
is estimated in the framework of Gaussian regression. The 
predictor impulse response is modeled as a zero mean 
Gaussian random vector. Its covariance matrix, referred to 
as kernel matrix, encodes the a priori information. In this 
case, the a priori information is that the predictor impulse 
responses are BIBO stable and that the model has a Sh-L 
structure. As we will see, such a priori information can be 
encoded in the kernel matrix using the maximum entropy 
principle. Moreover, this kernel matrix is characterized by the 
decay rate of the predictor impulse responses, by the number 
of conditional Granger causality relations among the mani¬ 
fest variables, and by the number of latent variables. These 
features are tuned by the so called hyperparameters vector. 
It is estimated by minimizing the negative log-likelihood 
of the measured data. Beside the fact that this problem 
is nonconvex, the joint estimation of the hyperparameters 
tuning the sparse and low rank part is not trivial, because 
these two parts may be nonidentifiable from the measured 
data. We propose an algorithm to estimate these hyperpa¬ 
rameters which imposes and “hyper-regularizer” on the low 
rank hyperparameter to handle partially this non-uniqueness. 
Once the kernel matrix is fixed, an unique estimate of the 
Sh-L model is guaranteed through regularization. 

We warn the reader that the present paper only reports 
some preliminary result regarding the Bayesian estimation 
of Sh-L models. In particular, all the proofs and most of the 
technical assumptions needed therein are omitted and will be 



published afterwards. 

The outline of the paper follows. In Section [I^ we intro¬ 
duce the Sh-L model for multivariate stochastic processes. 


In Section III we apply the Gaussian approach to Sh-L 
system identihcation. In Section |lvl we derive the kernel 
matrix by the maximum entropy principle. In Section |Vj we 
present the algorithm to estimate the hyperparameters of the 


kernel matrix. In Section VI we provide some numerical 


examples to show the effectiveness of our method. Finally, 
the conclusions are in Section Ivnl 


Notation 


xi conditionally Granger causes all the components in y and 
conversely, and yi conditionally Granger causes j/ 5 . 



Throughout the paper, we will use the following notation. 
5+ denotes the cone of the positive dehnite symmetric 
matrices, and its closure. Given v G M"* and G G 
Vi denotes the i-th entry of v and [G]ij denotes the entry of 
G in position IIGIjg denotes the weighted Frobenius 

norm of G with weight matrix Q G Given a transfer 
matrix L{z) of dimension m x m, with some abuse of 
terminology, we say that L{z) has rank equal to n, with 
n < m, if it admits the decomposition L{z) = FH{z) 
where F G and iT(z) is a n x m transfer matrix. 

Given a stochastic process y = {y(t)}tez, with some abuse 
of notation, y(t) will both denote a random vector and its 
sample value. Finally, 

y~(i) := [ (1) 


Fig. 1. Example of a sparse plus low rank model with m = 6 and n = 1. 

Our main modeling assumption in is that manifest 
variables Granger cause each other mostly through few latent 
variables. Therefore, we have n m, i.e. the number of 
latent variables in x is small as compared to the number of 
manifest variables in y\ similarly S{z) is sparse, i.e. many 
of its entries are null transfer functions, so that few manifest 
variables conditionally Granger causes each other. From Q, 
we obtain the sparse plus low rank (Sh-L) model for y. 

y{t) = S{z)y[t) + L{z)y{t) + e{t) (3) 


denotes the past data vector of y at time t. In similar way, 
y~ {t) denotes the past data vector of yi. 

11. Sh-L Models 

Consider two zero mean stationary Gaussian stochastic 
processes y and x of dimension m and n, respectively. Let y 
be manifest, i.e. it can be measured, and x latent, i.e. it cannot 
be measured. We assume that y{t) G K"* and x(t) G K" are 
described by the model 

y{t) = Fx{t) + S{z)y{t)+v{t) 

x{t) = H(z)y{t) +w{t) (2) 

where H{z) = HkZ~^ is a BIBO stable transfer ma¬ 
trix of dimension n x m, F G S{z) = 

is a BIBO stable transfer matrix of dimension mxm,v and w 
are, respectively, m and n dimensional white Gaussian noise 
(WGN) with zero mean, and covariance matrix and 
Moreover, v and w are independent. 

It is possible to describe the structure of model Q using 
a directed graph (i.e. network) as in Figure [T] [14]. Each 
node of this network corresponds to components of y or x; 
edges encode conditional Granger causality relations. More 
precisely, there is a direct link from node yj (xj) to node yi 
(xi) if and only if yJ (t) (Xj{t)) is needed to predict yi{t) 
{xi{t)), conditionally on all the past information available 
at time t, i.e. y^(f) with k ^ j and x{t) (y~{t) and 
Xk{t) with k 7 ^ j). In this case, we shall say that yj (xj) 
conditionally Granger causes yi (Xi), [12]. In Figure we 
provide an example with m = 6 and n = 1. In particular. 


where S{z) is a sparse transfer matrix by assumption, 
L{z) := FH{z) is a low rank transfer matrix because F 
and {z) are tall matrices, and e(<) := v{t) + w{t) is 
WGN with covariance matrix E = -b FE^jF^. 

Let y{t\t — 1) be the minimum variance one-step ahead 
predictor of y{t) based on the observations y~{t), and 
x{t\t— 1) be the minimum variance estimator of x{t) based 
on From (j^, we have 

y{t\t-l) = Fx{t\t - 1) + S{z)y{t) 
x{t\t-l) = H(z)y{t). (4) 

that is, the predictable part of y is a function of few estimated 
latent variables and of the pasts of few manifest variables of 
y- Eq. 0 can be compactly written as 

y(t|i-l) = S{z)y{t) + L(z)y{t). (5) 

It is worth noting that in the case that L{z) = 0, i.e. there 
is no need of latent variables to characterize the predictor of 
y, we obtain the sparse model presented in [7]. In the case 
S{z) = 0, i.e. the predictor of y is completely characterized 
by the estimators of the latent variables, we obtain a quasi¬ 
static factor model where the noise process is white, see for 
instance [10]. We would like to stress that the decomposition 
of a transfer matrix into sparse plus low rank may not be 
unique. As noticed in [5], this degeneracy may occur when 
L{z) is sparse or S{z) has few null entries. Although this 
nonidentihability issue is important, the aim of this paper is 
to hnd one Sh-L decomposition (see Section |nl| which is not 
necessarily unique. 




III. Gaussian Regression Approach to System 
Identieication 

Consider model and assume that the measured data 
y{l)... y{N) are extracted from a realization of y. The latent 
process x cannot be measured nor its dimension n is known. 
In this Section, we address the problem of estimating S{z) 
and L[z) from the given data. We draw inspiration from the 
Gaussian regression approach proposed in [16]. According 
this method, y is generated by model 

y{t) = G{z)y{t) + e{t) (6) 

where G(z) is a BIBO stable m x m transfer matrix and 
e is WGN with covariance matrix E. Note that, if we set 
G(z) = S(z)+L(z) then (j^ is equivalent to ([^. On the other 
hand, with (|^ we loose the S+L structure we are interest in. 

Since G{z) is BIBO stable, and thus the impulse response 
coefficients decay to zero as a function of the lag index, it 
is possible to use the approximation 

T 

G{z) = Y, Gkz-^ (7) 

k=l 

where T is sufficiently large. The parameters Gk, k = 1, .,T 
of the truncated transfer matrix are stacked in the vector 
6 G which is defined as follows 

0 = [ ... (pM)T|... 

...|(pM)’r ... (8) 

where 

= [G,l, ... [Grh]^ ( 9 ) 

denotes the impulse response coefficients of the transfer 
function in position {i,j) of the truncated approximation of 
G(z). Then, we stack the measured data in the vector y as 
follows 

y = [yi(r+l)T ... yi(iV)T|... 

... I j/^(r+l)T ... y„(7V)T]T.(10) 

In similar way, we define 

e = [ei(r+l)T ... ei(iV)T|... 

...|e™(r+l)T ... e™(Ar)T ]^. ( 11 ) 

From the vector of the measured data can be expressed 
in the linear regression form 

y = $0 + e ( 12 ) 

where $ G jjmAxm t regression matrix. Note that, 

^6 is the one-step ahead predictor of y. 

According to the Gaussian regression framework, 6 is 
modeled as a zero mean Gaussian random vector with 
covariance matrix, or kernel matrix, denoted hy K G S^ 2 rp- 
Let 9 be the posterior mean of 9 given y. In [16] it has 
been proved that, under some technical assumption, 9 can 


also be written as solution to the Tikhonov regularization 
problem 

9 = argmin||y- T>6»|i|_i^j + ||0||^-i. (13) 

e 

Moreover, it is not difficult to see that 

9 = K^^{^K^^ + ll®lN)~^y- (14) 


Remark 3.1: Using the theory of reproducing kernel 
Hilbert spaces, [3], it is possible to show that the above 
results still hold for T —> oo, [16]. 

Remark 3.2: Although we assumed K G S^ 2 rp, the Bayes 
estimator dTO also holds for K singular. In that case. 
Problem (|13|l is well defined provided that 9 belongs to the 
range of K. 

The optimal solution 9 highly depends on the choice of K. 
A typical assumption is that the transfer functions in G(z) are 
independent, that is are independent vectors. We shall 
also assume that with i,j = 1 ... m, are identically 

distributed, so that K = 1^2 (g) K where K G is the 
covariance matrix of with i,j = 1.. .m. In this paper, 
K is chosen as a filtered version of the tuned/correlated (TC) 
kernel, see [16] and [ 6 ] for more details. It is important to 
note this kernel is able to capture high frequency oscillations, 
which are typical of the predictor impulse responses for low 
pass processes, and enforces BIBO stability as T —oo on 
the posterior mean of the predictor impulse responses. 


A. Gaussian Regression Approach to S+L Identification 

The idea is to model S{z) and L{z) through Gaussian 
process, similarly to what has been done above for G(z). 
In particular, since S{z) and L{z) are BIBO stable, we can 
consider their truncated approximations with T is sufficiently 
large. Then, it is not difficult to see that 


y = <^{9i+9,)+e (15) 


where y and e have been defined in ( [T0| i and (ITi. 9st9i G 
K™ T contains the parameters of the truncated approxi¬ 


mations of S(z) and L(z), respectively. Then, we model 
9s and 9i as zero mean random vectors with covariance 
matrix Kg and Kg, respectively. Moreover, we shall assume 
that 9s and 9i are Gaussian and independent. As we will 
see in Section |IV] these assumptions are suggested by the 
maximum entropy principle. 

Proposition 3.1: Let 9s and 9i be, respectively, the poste¬ 
rior mean of 9s and 9i given y. Then, under some technical 
assumption, 9s and 9i are solution to the Tikhonov regular¬ 
ization problem 


argmin||y-$(6»s-f6»i)|lE-i^j„ 




(16) 


Moreover, we have 


9s = Ks<l>^c, 9 i=Kl<1>^c (17) 


where 

c = {^Ks + Kl)^^ + E (g) 


(18) 



In what follows, 6g and 6i will be referred to as posterior 
mean of S{z) and L{z), respectively. In the next Section, we 
shall show how Ks and Kl can be chosen so as to enforce 
BIBO stability on both the posterior mean of S{z) and L{z), 
sparsity on the posterior mean of S{z) as well as low rank 
of the posterior mean of L{z). 

IV. Maximum Entropy Kernel Matrix 


In this Section we characterize the prior probability density 
of 6s and 9i by using the maximum entropy principle. Such 
principle states that among all the prior probability densities 
satisfying certain desired constraints, the optimal one should 
maximize the differential entropy. 

Our starting assumptions are that dg and 0; are absolutely 
continuous zero mean random vectors. Let p{dg,di) denote 
the joint probability density of dg and 9i. Let E denote 
the integration over ^ with respect to the probability 
measure p. Moreover, V denotes the space of probability 
densities which are Lebesgue integrable. The differential 
entropy of p G V is, [9], 


H(p) = -E[log(p(0„0O)]- 


(19) 


Next, we characterize the constraint on dg enforcing BIBO 
stability and sparsity on the posterior mean of S{z). The 
transfer function in position {i,j) of S{z) is the null transfer 
function if and only if sl'-'l is the null vector. We consider 
the constraint 


( 20 ) 


where pij > 0. If ptj = 0, then si®-!! is zero in mean square 
and so also is its posterior mean. Moreover, simple algebraic 
manipulations show that the weighted second moment bound 
in ( [20l l implies a bound on the variance of fc-th element of 
givr^jjich decays as the fc-th element in the main diagonal 
of K. Therefore, condition (201 enforces BIBO stability and 
sparsity on the posterior mean of S{z). 

Regarding the low rank constraint on di, let Ai G 
be the random matrix such that 


Ai=[ Li L2 ... Lt]. (21) 

Consider the constraint 

¥.[Ai{k-^®Ira)Aj]<Q. (22) 

If Q G has m — n singular values equal to zero, then 
the posterior mean of AiAhJ has rank less than or equal to 
n. Therefore, the latter admits the decomposition 

Ai = [ FHi FH2 ... FHt ] , (23) 

where F G and Hk G fc = 1...T, as in 

Section Equivalently, the posterior mean of L{z) admits 
the decomposition L{z) = FH{z). Similarly to the sparse 
part, the weight matrix K~^ 0 enforces BIBO stability 
on the posterior mean of L{z). 

Consider the following maximum entropy problem 

max H(p) 

s.t. E[||s[®-5l||^_J = 

E[AiiK-^®I^)Aj]<Q (24) 


where pij > 0 i,j = 1... m, and Q G 

Theorem 4.1: The optimal solution to (24i is such that 
9g and 9i are independent, Gaussian with zero mean and 
covariance matrix 


Ks=Ti^K, Kl = K®I„ 


K 


(25) 


where 

r = diag( 7 ii... 7 ^ 2 ) e 5+2 (26) 

and A S 5+. 

The matrices L and A are the hyperparameters of the 
kernel matrices Ks and Kl, respectively. Clearly, we are 
interested in the limiting case where pij — 0 for some 
{i,j) and Q low-rank. Let P = {{i,j) s.t Pij = 0} and 
Q = {u G M®®® s.t Qv = 0}. Then, it can be shown that the 
maximum entropy solution can be extended by continuity 
to this limiting case where 7(i_i)m-i-j = 0 if and only if 
{i,j) G P and Au = 0 if and only if G Q. Thus, L tunes 
sparsity on the posterior mean of S(z) and A tunes the rank 
on the posterior mean of L(z). Linally, is worth noting that 
the hyperparameters tuning the decay rate of the posterior 
mean of the predictor impulse responses are encoded in K, 
[17]. 


V. Estimation of the Hyperparameters 
In order to compute k 9g we need to estimate 
the hyperparameters in K and the matrices L and A. The 
hyperparameters describing K are estimated in a preliminary 
step by minimizing the negative log-likelihood of model (j^, 
see [16]. r and A are obtained minimizing the negative log- 
marginal likelihood £ of y. Under some technical assumption, 
we have, [16], 

f(y,r,A) = ^logdetU -I- ^y+U“^y-I-const.term ( 27 ) 
where 

V = ^(Ks + Kl)^^+ S<8 >I^. (28) 

Since ( [27] i is nonconvex in V, only local minima can be 
computed. Beside that, the joint minimization of L and A 
is not trivial because the sparse and low rank part may be 
nonidentifiable from the measured data. Lor this reason, we 
constrain the structure of A as follows: 


A = a(I — (7{7+) -I- U diag(/3i... Pr)U^ 


(29) 


where U G and its columns are the first r singular 

vectors of an estimate AiAJ of AiAj. In this way, the 
constraints in A are decoupled along the “most reliable” r 
singular vectors of AiAJ and their orthogonal complement. 
This is equivalent to fix r latent variables (from the estimate 
AiAj). Regarding the hyperparameter L, in [2] it has been 
shown that the minimization of ( [27] ) leads sparsity in the 
main diagonal of L. Therefore, we minimize (27) with 
respect to ^ = {71 ... 7^2 , a, / 3 i ... jlr} while r and U are 
fixed. The complete procedure to estimate r, U and ^ is 
described in Algorithm and denote, 

respectively, r, U, Ai and ^ at the fc-th iteration. Linally, 
to minimize efficiently ^n\ with respect to f we used the 
scaled gradient projection algorithm developed in [4]. 
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Algorithm 1 Computation of r, C/ and ^ 
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Ks and Kl having hyperparameters given by 
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end if 
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the coefficients of P*‘\z) estimated from 1 16 
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VI. Simulation results 

We consider three Monte Carlo studies of 50 runs where at 
any run a model with m = 6 manifest variables is randomly 
generated. For each run in the Monte Carlo experiments an 
identification data set and a test set, both of size 500, are 
generated. The noise covariance matrix E is always estimated 
via a preliminary step using a low-bias ARX-model, see [11]. 

In the first experiment, the models have McMillan degree 
equal to 20, and are perturbed versions of 0 with I = 1 
latent variable and four non null transfer functions in S{z). 

The second experiment is identical to the first one, with 
the exception that the latent variables are I = 2. 

In the third experiment, the models have McMillan degree 
equal to 30, but without a special structure. 

We compare the following one-step ahead predictors: 

• TRUE: this is the one computed from the true model 

• PEM: this is the one computed from the PEM approach. 


TABLE I 

Average relative complexity oe the S+L Bayesian Network 


Epx. # #1 #2 #3 

AC 55.89 63.72 81.56 


as implemented in pem.m function of the MATLAB 
System Identification Toolbox 

TC: this is the one computed with the approach de¬ 


scribed at the beginning of Section III 


• SL: this is our method described in Section IIII-AI 
The following performance indexes are considered: 

• average relative complexity of the Sh-L network of the 
SL model (in percentage) 


^ 100 ^ #SLk 
50 ^ m'^T 


(30) 


where #SLk is the number of parameters of the esti¬ 
mated Sh-L model at the fc-th run, whereas m?T is the 
number of parameters of a nonstructured model 
• one-step ahead coefficient of determination (in percent¬ 
age) 


COD = 

^^500||y,est(i)-ytest||2 ) 

where y*®®* denotes the sample mean of the test set data 
2 /(1)*®®*... 2/*®®*(500) and y*®®*(f|f — 1) is the one-step 
ahead prediction computed using the estimated model 
• average impulse response fit (in percentage) 

AIRE = 100 f 1 - ^k=i\\Gk-Gk\\ \ 
with G = i ^ Gk. 

Table |I] shows the percentage of the average relative com¬ 
plexity of the Sh-L network. In particular, in the first two 
experiments our method is able to detect that the underlying 
model is close to have a simple Sh-L network. Ligure 
shows the COD in the first two experiments. One can see 
that SL provides a slightly better performance than TC. 
On the other hand, SL provides better estimators for the 
predictor coefficients than the TC, Ligure Linally, Ligure 
1^ shows the COD in the third experiment. The median of 
SL is slightly worse than the one of TC. On the other 
hand, the bottom whisker of SL is better than the one of 
TC. Indeed, SL simplified the Sh-L network, see Table 
increasing the robustness of the estimated predictor impulse 
response coefficients. 

VII. Conclusions 

In this paper, we proposed a Gaussian regression approach 
to identify multivariate stochastic processes having sparse 
network with few latent nodes. Simulations show that our 
approach is able to identify a Sh-L network which does not 
compromise the prediction performance. 
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Fig. 2. One step ahead coefficient of determination in the first experiment 

(left panel) and in the second experiment (right panel). Fig. 4. One step ahead coefficient of determination in the third experiment. 



PEM TC SL PEM TC SL 


Fig. 3. Average impulse response fit in the first experiment (left panel) 
and in the second experiment (right panel). 


References 

[1] A. Abdelwahab, O. Amor, and T. Abdelwahed. The analysis of 
the interdependence structure in international financial markets by 
graphical models. Int. Res. J. Finance Econ., 15:291-306, 2008. 

[2] A. Aravkin, J. Burke, A. Chiuso, and G. Pillonetto. Convex vs non- 
convex estimators for regression and sparse estimation: the mean 
squared error properties of ard and glasso. Journal of Machine 
Learning Research, 15:217-252, 2014. 

[3] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. 
Soc., 68:337^04, 1950. 

[4] S. Bonettini, A. Chiuso, and M. Prato. A scaled gradient projection 
method for Bayesian learning in dynamical systems. SIAM Journal 
on Scientific Computing, 37:1297-1318, 2015. 

[5] V. Chandrasekaran, P. Pamlo, and A. Willsky. Latent variable 
graphical model selection via convex optimization. Annals of Statistics 
(with discussion), 40(4):1935-2013, Apr. 2010. 

[6] T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer 
functions, regularizations and gaussian processes-revisited. Automat- 
ica, 48(8):1525-1535, 2012. 


[7] A. Chiuso and G. Pillonetto. A Bayesian approach to sparse dynamic 
network identification. Automatica, 48(8): 1553-1565, 2012. 

[8] M. Choi, V. Chandrasekaran, and A. Willsky. Gaussian multiresolution 
models: Exploiting sparse markov and covariance structure. IEEE 
Transactions on Signal Processing, 58(3): 1012-1024, March 2010. 

[9] T. Cover and J. Thomas. Information Theory. Wiley, New York, 1991. 

[10] M. Deistler and C. Zinner. Modelling high-dimensional time series 
by generalized linear dynamic factor models: An introductory survey. 
Communications in Information & Systems, 7(2):153-166, 2007. 

[11] G. Goodwin, M. Gevers, and B. Ninness. Quantifying the en'or in 
estimated transfer functions with application to model order selection. 
IEEE Transactions on Automatic Control, 37(7):913-928, Jul 1992. 

[12] C. Granger. Investigating causal relations by econometric models and 
cross-spectral methods. Econometrica, 37(3):424^38, Aug. 1969. 

[13] E Kolaczyk. Statistical analysis of network data: methods and models. 
Springer, 2009. 

[14] S. Lauritzen. Graphical Models. Oxford University Press, Oxford, 
1996. 

[15] L. Ljung. System Identification (2Nd Ed.): Theory for the User. 
Prentice Hall, New Jersey, 1999. 

[16] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error 
identification of linear systems: A nonparametric gaussian regression 
approach. Automatica, 47(2):291-305, 2011. 

[17] G. Pillonetto and G. De Nicolao. A new kernel-based approach for 
lineal* system identification. Automatica, 46:81-93, 2010. 

[18] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Ker¬ 
nel methods in system identification, machine learning and function 
estimation: A survey. Automatica, 50(3):657-682, 2014. 

[19] T. Sdderstrom and P. Stoica, editors. System Identification. Prentice 
Hall, New Jersey, 1988. 

[20] A. Werhli, M. Grzegorczyk, and D. Husmeier. Comparative evalua¬ 
tion of reverse engineering gene regulatory networks with relevance 
networks, graphical gaussian models and Bayesian networks. Bioin¬ 
formatics, 22(20):2523-2531, 2006. 

[21] M. Zorzi and R. Sepulchre. AR identification of latent-vaiiable 
graphical models. Conditionally accepted in IEEE Transactions on 
Automatic Control, 2015. 































































































